Insights into modeling refractive index of ionic liquids using chemical structure-based machine learning methods

Ionic liquids (ILs) have drawn much attention due to their extensive applications and environment-friendly nature. Refractive index prediction is valuable for ILs quality control and property characterization. This paper aims to predict refractive indices of pure ILs and identify factors influencing refractive index changes. Six chemical structure-based machine learning models called eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), Categorical Boosting (CatBoost), Convolutional Neural Network (CNN), Adaptive Boosting-Decision Tree (Ada-DT), and Adaptive Boosting-Support Vector Machine (Ada-SVM) were developed to achieve this goal. An enormous dataset containing 6098 data points of 483 different ILs was exploited to train the machine learning models. Each data point’s chemical substructures, temperature, and wavelength were considered for the models’ inputs. Including wavelength as input is unprecedented among predictions done by machine learning methods. The results show that the best model was CatBoost, followed by XGBoost, LightGBM, Ada-DT, CNN, and Ada-SVM. The R2 and average absolute percent relative error (AAPRE) of the best model were 0.9973 and 0.0545, respectively. Comparing this study’s models with the literature shows two advantages regarding the dataset’s abundance and prediction accuracy. This study also reveals that the presence of the –F substructure in an ionic liquid has the most influence on its refractive index among all inputs. It was also found that the refractive index of imidazolium-based ILs increases with increasing alkyl chain length. In conclusion, chemical structure-based machine learning methods provide promising insights into predicting the refractive index of ILs in terms of accuracy and comprehensiveness.

www.nature.com/scientificreports/ molecular descriptors. Once they predicted the refractive indices in one temperature, then they performed their estimation varying 8 different temperatures. They concluded that the best ML method for one-temperature and eight-temperature investigations were ASNN and DNN, respectively. Additionally, the best representation for their QSPR model to predict the refractive index were CDK23 in both investigation cases. The statistical outcome of their research was an R 2 of 0.86, an RMSE of 0.016, an MAE of 0.0081 for only one temperature, and an R 2 of 0.922, an RMSE of 0.0112, and an MAE of 0.00725 for when they changed the temperature in the specified range. As the literature review illustrated, the studies that used methods other than machine learning incorporated few ILs and data points for modeling, whereas the studies that developed machine learning with many data points obtained insignificant accuracy. Further, using wavelength as an input in machine learning models has not yet been investigated. This study uses six robust chemical structure-based machine learning models named XGBoost, LightGBM, CatBoost, CNN, Ada-DT, and Ada-SVM to predict the refractive indices of an extensive database, including the temperature, wavelength, and chemical substructures of each data point as inputs. The database contains 6098 data points from 483 ILs. Besides, this study investigates how each chemical substructure, temperature, and wavelength affect the refractive index. The best model is also introduced through statistical analysis and graphical representation.
Data collection. The dataset used in this study was obtained from the NIST Ionic Liquids Database (SRD#147 v2.0) 40,41 . The extracted data comprise 6098 data points belonging to 483 different pure ILs. The temperatures under which these ILs were tested varied from 278.15 to 368.1 K. The wavelength in the experimental data changed from 430.1 to 822.7 nm, resulting in refractive index changes of 1.335-1.7. Additionally, the molecular weights of the ILs used in this study differ from 77.08 to 866.64 g/mol. Brief statistics of the data used in this study are presented in Table 1. Table S1 in Supplementary Information section presents all the ILs used in the present study with the ranges of their temperature, wavelength, and refractive index, as well as the number of corresponding data points.
Graphical demonstrations showing the dispersion of the database regarding the temperature and wavelength are presented in Figs. 1 and 2. Figure 1 shows that most data points lie between 298.14 K and 303.14 K. Moreover, as shown in Fig. 2, the majority of the wavelengths have a wavelength of 589.3 nm, which is the sodium D line wavelength. The scarcity of the other wavelengths should not be interpreted such that this parameter has a negligible role in determining the refractive index. As the literature states, the pressure, temperature, composition, and the light source wavelength are the variables that correlate with the refractive indices of liquid mixtures 42 . Cauchy's equation and the Sellmeier equation used by Guo et al. 43 and Arosa et al. 1 , respectively serve the purpose

Modeling
Modeling procedure. The process of modeling, starting with data gathering and ending with results analysis, is provided in the flowchart in Fig. 4. The strong models used in this study are CatBoost, XGBoost, Light-GBM, Ada-DT, CNN, and Ada-SVM. The following sections present extensive descriptions of the used models, their hyperparameters, and their inputs.   Anion Code [im] [n] [py] [pyr] [p] [mo] [pip] [pz] [s]   www.nature.com/scientificreports/ Model development. XGBoost. XGBoost is a scalable tree boosting system that is one of the most successful and broadly used methods of machine learning. The model choice of XGBoost is called "decision tree ensemble", which comprises a set of classification and regression trees (CARTs). Because using one single tree is not practically enough, an ensemble model that sums the prediction of various trees is commonly used 44 . The model can be written as Eq. (1): where K is the number of trees, f k is a function in the regression tree space F , and F is the set of all possible CARTs. To train the model, the objective function is defined and minimized.
where � f k = γ T + 1 2 �w� 2 . The first term of Eq. (2) indicates the training loss, and the second one is the regularization term which controls the complexity of the model. The regularization term is necessary to avoid overfitting. Since it is intractable to learn all the trees at once, XGBoost uses an additive training strategy. Therefore the prediction of the i-th instance of the t-th iteration is substituted to the objective function.
By adding a new tree at a time, new predictive values are generated step by step 44 .
LightGBM. Another algorithm besides XGBoost, which uses the GBDT framework, is LightGBM. The primary purpose of this method is to increase computational efficiency so the predicting problem can be carried out more effortlessly 45 . LightGBM has two features that help solve the problem more cost-effectively: histogrambased decision tree algorithm and leaf-wise growth strategy. Unlike XGBoost and many boosting tools, which use pre-sort-based algorithms for decision tree learning, LightGBM uses histogram-based algorithms. In a histogram-based decision tree algorithm, floating-point eigenvalues are discretized into bins used to construct the histogram. Once the histogram accumulates gradients and samples within each compartment, we can find the optimal segmentation point using the discrete value of the histogram. In accordance with Fig. 5, in a level-wise growth approach, the leaves on each layer are separated at the same time. This strategy is inefficient regarding www.nature.com/scientificreports/ memory consumption because many leaves have low information gain, and it is unnecessary to search and split them. Leaf-wise growth approach, instead, splits only the leaves with the most significant information gain on each tree layer. This strategy reduces memory usage and speeds up training 45,46 .
CatBoost. CatBoost is a gradient boosting method that primarily aims to decrease the prediction shift during data training 47 . The prediction shift arises from a particular type of target leakage present in all implementations of gradient boosting algorithms 48 . One of the benefits of CatBoost is that it uses an innovative algorithm that treats categorical features as numerical characteristics. It also combines category features that exploit the connections between features, enriching feature dimensions. In addition, it employs a symmetrical tree model to overcome the overfitting problem, so the algorithm becomes more accurate and generalized 49 .
Convolutional neural network. In the literature, CNN architectures can be found in a wide variety of variations; however, they are all based on similar fundamental principles. A sample CNN comprises three layers other than input and output layers: convolutional layer, pooling layer, and fully-connected layer. Learning characteristic representations of the inputs is the purpose of the convolutional layer. By reducing the feature maps' resolution, the pooling layer achieves shift-invariance. Some fully-connected layers may exist after several convolutional and pooling layers. The final layer of a CNN is the output layer. Minimizing a loss function on a particular task can obtain the optimal parameters for that task. The CNN loss function is defined as follows 50 : where N is the number of desired input-output connections, θ is all the parameters of the CNN, y (n) is the n-th data's corresponding target label, and o (n) is the output of the CNN. The best fitting parameters can be obtained by minimizing the L function. As a matter of fact, training a CNN is a global optimization problem 50 .
AdaBoost. Generally, boosting is a solid approach for increasing regression and classification models' predictive power and accuracy. AdaBoost is a boosting algorithm that performs in the 4 following steps 51 : 1. The input containing the number of cycles, a learning algorithm, and a set of training samples are given to the AdaBoost algorithm. 2. AdaBoost gives a number of identical weights to all training samples. 3. It calls the algorithm to train a classifier regarding the weighted training samples and calculates the error.
Then, it sets the weight for the component classifier and updates the weight of the training samples in some defined loops. 4. This procedure advances the specified cycles, and finally, AdaBoost linearly integrates all the component classifiers into a single output.
Decision tree. Decision trees work by splitting a dataset sequentially into small segments until the target variables match or until the dataset cannot be divided anymore. The algorithm is greedy because it makes the best decision at the given time without considering global optimality 52 . There are different types of DT algorithms, but all of them use a similar structure explained in the following steps 53 : 1. Assigning every training instance to the tree's root and setting the node as the tree's root node. 2. Finding the split characteristic and value in accordance with the split criterion. The criterion might be the Gini coefficient, information gain, or information gain ratio. 3. Using the split feature and threshold number to divide all data points in each node. 4. Designating all partitions of the current node as child nodes. 5. Tagging every child node as a leaf and returning for child nodes with only one class instance; otherwise, setting the node as the current node and returning to step 2.
Support vector machine. Support Vector Machine is another famous supervised learning algorithm that can be utilized for both regression and classification problems. This algorithm plots data points as a single point in an n-dimensional space, where n is the number of inputs. The target of the SVM algorithm is to make the best line to separate the n-dimensional space into discrete classes so that new data points can be put in the appropriate category later. This line is called a hyperplane; the farther it is from the points of any class, the better separation is achieved. Thus, there is not only one hyperplane capable of separating the data, but the best one is the one with the most significant margin between the two classes. It is also worth mentioning that the closest points to the hyperplanes are called support vectors. Figure 6 shows two hyperplanes with small and maximal margins (H 2 and H 3 ) and one that fails to separate the classes correctly(H 1 ). The decision function in the SVM algorithm is 51 : www.nature.com/scientificreports/ where b is the bias term, ∅(x) is a mapping of x from the input space to the n-dimensional feature space, and w is the weight of the sample. To obtain the optimal values of w and b , the following optimization problem has to be solved: where the regularization parameter is C and ξ i is the i-th slack variable 51 .

Hyperparameters optimization.
Adapting a machine learning model to different problems requires tuning its hyperparameters. As a result, choosing the appropriate hyperparameter configuration is a crucial step in the development of machine learning models, as it directly affects their performance 54 . In Table 3, all the adjusted hyperparameters have been provided in addition to their range of analysis.
Input parameters. The independent inputs of our models are temperature, wavelength, and chemical substructure of the ILs. Since the temperature has a well-known influence on the refractive index, it is included in model inputs 37 . However, the wavelength never caught the researcher's attention to be considered as input to their machine learning models. Numerous studies focused on analyzing and determining the refractive index at a single wavelength called the sodium D line, which equals 589.3 nm. A material's chromatic dispersion, which is the variation of the refractive index over a range of wavelengths, will be left out of the study if the wavelength is excluded in modeling. Furthermore, much information about the chemical composition and physical properties can be obtained by considering the wavelength as input 1 . While dispersion is minimized in certain applications (for instance, optical communications systems and imaging systems), it benefits other applications (for instance, dispersive prisms in laser cavities, for compensating dispersion introduced by other optical components, or in optical spectrometers). The refractive index dispersion of an optical device must be accurately characterized in both cases in order to ensure optimal performance 55 .
The chemical substructure of an ionic liquid was also used as input in the models, similar to what Valderrama et al. 56 proposed. A list of the chemical substructures used in this study is presented in Table 4. Figure 7 shows a sample of the process in which an ionic liquid can be fragmented into its substructures. In Fig. 7, the cation of the presented ionic liquid has been fragmented into two -CH 3 , one [> N =] + (with rings), three = CH-(with rings), one > N-(with rings), and one -CH 2 -substructures. Likewise, the anion part of the ionic liquid can be fragmented into four -F and one -B substructures. It is also worth mentioning that pressure was not included in the inputs because the pressure change has little effect on the refractive index 57 .

Assessment of models
Statistical assessment. It is crucial to evaluate the proposed models' accuracy by statistical and graphical analysis of the results. Statistical model performance is assessed through the use of a set of equations known as average absolute percent relative error (AAPRE), coefficient of determination (R 2 ), root mean square error (RMSE), and mean absolute error (MAE). www.nature.com/scientificreports/  Table 4. A set of 36 chemical substructures utilized in this study.

Substructures
Without rings www.nature.com/scientificreports/ where N is the number of data points, x i,exp is the i-th experimental value of the refractive index, x i,pred is the i-th predicted value of the refractive index, and x exp is the average value of the experimental amount of the refractive index.
Graphical assessment. In addition to statistical assessment, the model can be evaluated through visual plots containing cross plots, error distribution plots, box and whisker diagrams, heatmaps, and the cumulative frequency plot. A cross plot shows how well data are distributed around the ideal X = Y line. The X = Y line is a line on which the data must be placed if the prediction is performed ideally. An error distribution plot depicts the relative error versus the experimental value of the data. Prediction accuracy decreases as data deviates from the Y = 0 line.
Box and whisker diagrams are supposed to illustrate how well the model predicts the refractive index value concerning the four quartiles of a specific anionic or cationic family. The heatmaps also show the distribution of certain parameters with respect to the relative error. The cumulative frequency plot shows what portion of data is subject to less than a certain amount of an absolute relative error. Therefore, the x-axis is the absolute relative error.
where x i,exp is the i-th experimental value of the refractive index and x i,pred is the i-th predicted value of the refractive index.

Results and discussion
Statistical analysis. The results of this study present the effectiveness of machine learning methods in predicting the refractive indices of a wide range of ILs, which was the original purpose of this study. Six well-known machine learning models based on the chemical structures of numerous ILs were employed to achieve the aim of this study. Although all the models performed satisfyingly, the most accurate ML method was CatBoost; followed by XGBoost, LightGBM, Ada-DT, CNN, and Ada-SVM. Table 5 illustrates the statistical results of the training, testing, and overall data for each model used. The results showed a remarkably low error for most of the models. Among them, CatBoost results with an overall R 2 , AAPRE, MAE, and RMSE of 0.9973, 0.0545, 0.0008, and 0.0021, respectively, reveal the extraordinary power of this ML model to predict the refractive index of more than 6000 data points. The least accurate model is Ada-SVM, with the statistical specifications of 0.9227 for R 2 ,   Table 6. According to Table 6, this research considered the most ILs for predicting the refractive index, and the number of input data points was higher than in most studies. Thus, the current study is comparable with the one Baskin et al. 39 did regarding the extensiveness of the input points. However, concerning the errors, this research showed much better results than Baskin et al. 's study 39 . Unlike previous ML studies, our study includes the wavelength as model input. This inclusion takes another essential factor into account for the refractive index prediction. Additionally, a wide range of temperatures was used in our dataset, giving the study an edge over the survey of Baskin et al. 39 . Figures 8 and 9 compare the present study's R 2 of the CatBoost model and the number of our data points with the pure ionic liquid studies from the literature, respectively. The superiority of the current research is evident in Figs. 8 and 9 in terms of R 2 and the number of data points, respectively. Also, Fig. 10 shows these two parameters together in one diagram. The data point number indicates a study's comprehensiveness, while the R 2 is a criterion to assess its accuracy. The upper right region of Fig. 10 is where the most accurate and comprehensive studies take place. While Baskin et al. 39 utilized more data points in their research, their accuracy was lower than the current study's. It is evident that the best accuracy was obtained in the current study.
In addition to our six main models, we developed an auxiliary MLP model for better comparison with the literature utilizing pure ionic liquids, due to the fact that some studies used MLP models. This research's MLP model has two hidden layers with 4 and 2 neurons, respectively, and the transfer functions used in the layers are "tansig" for the first hidden layer, "logsig" for the second hidden layer, and "purelin" for the output layer. A comparison between the results of our MLP model and the literature is shown in summarized in Table 7. A closer inspection of the table reveals that the MLP models perform very well with smaller datasets. While it is difficult to fairly compare all the error metrics as the literature did not fully provide their various metrics, our MLP's results in Table 7 show generally accepted errors. In comparison with other models though, MLP generally might not be as accurate for larger datasets.
Graphical analysis. Graphical analysis is provided to demonstrate the results in another unambiguous way. Figure 11 exhibits the deviance of the data from the ideal X = Y line. The closer the data gets to this line, the better the prediction. A visual inspection of Fig. 11 discloses the accuracy of the used models. Another finding of Fig. 11 is the superior accuracy of the CatBoost model. The error distribution diagrams shown in Fig. 12 lay out the relative error of the predicted refractive index concerning the experimental data. Again, the results are satisfactory since most points have a relative error of less than 2%. A visual comparison confirms that the best model is CatBoost, and the worst is Ada-SVM.
In addition, the relative errors of the four quartiles of each cation and anion family using CatBoost model are presented in Figs. 13 and 14, respectively. The box and whisker diagrams show that the data in all quartiles of both cation and anion families can predict the refractive index correctly due to their low relative errors. The central marks on the boxes indicate the median. The boxes show the second and third quartiles, and the whiskers and outliers illustrate the first and fourth quartiles of the data. Outlier points are not visible in the figures. Figure 15 illustrates the dataset's mean absolute relative errors for each cation-anion family pair using the CatBoost model. Most ionic liquid families have low mean absolute relative errors. At the same time, a lack of sufficient data points in some specific ionic liquid families resulted in higher mean absolute relative error values, which are still acceptable. The distribution of relative error as a function of temperature utilizing the CatBoost www.nature.com/scientificreports/ model is shown in Fig. 16. As expected, most data points are located near the relative error equal to zero, such that 743 data points are in the temperature range of 293. 15-298.15, with relative errors between -0.12% and 0.30%. Also, the diagram displays that very few data points have large relative errors even though the most significant relative error is less than 5%, which is acceptable. An additional informative diagram demonstrating a suitable model comparison is shown in Fig. 17. As mentioned before, the CatBoost model performs better than the others regarding the refractive index prediction of ILs. The dashed line of absolute relative error (Error) indicates the value of 0.13%, meaning that 90% of the data analyzed with our best model, CatBoost, have a phenomenal absolute relative error of less than 0.13%. The diagram also illustrates that the most accurate model was CatBoost, followed by XGBoost, LightGBM, Ada-DT, CNN, and Ada-SVM.

Sensitivity analysis.
Understanding the significance of each input on the output requires a sensitivity analysis of the results. Here, a relevancy factor was used, which is defined as follows 58 :   www.nature.com/scientificreports/ where y i and y are the i-th value and the average value of the predicted refractive index, respectively; and x v,i and x are the i-th value and average value of the v-th input. The input with the highest absolute relevancy factor impacts the output amount most. Figure 18 shows a complete set of the models' inputs and their corresponding relevancy factors considering the CatBoost model. The colors and the names correspond from top to bottom to avoid confusion. As the figure depicts, the most substantial impact on the refractive index is generated by -F and > C < substructures with a corresponding relevancy factor of − 0.75 and − 0.47, respectively. The notation "with rings" in Fig. 18 implies that the mentioned chemical substructure is located inside a ring in the chemical structure, as defined by Valderrama et al. 56 . In addition, if the r factor of any input is negative, it decreases the refractive index and vice versa. The collected absolute amounts of these factors can be seen in Table 8. Table 8 clearly indicates that temperature and wavelength are not amongst the first half of the most dominant factors influencing the refractive index. The relevance importance rank of the wavelength is 28th among 38 factors. Although it did not gain a high rank, the importance of the wavelength is not ruled out. The fact that the existence or the quantity of certain substructures outranks the wavelength emphasizes the necessity of choosing appropriate materials, not the negligible role of wavelength. Once the material is chosen (i.e. all the substructures remain constant), the wavelength change shows its significant role.
The Pearson correlation coefficient used here is especially useful when the data distribution is normal. Otherwise, the correlation coefficients are advised to be calculated from the ranks of the data instead of their actual values. The coefficients recommended for this goal are Kendall's tau and Spearman's rho. Since some researchers suggest that Kendall's tau may draw more accurate generalizations compared to Spearman's rho 59 , the absolute Kendall's tau is reported in Table 9. Kendall's tau formula in case of tied ranks is as follows 60 :  www.nature.com/scientificreports/ where τ B is Kendall's tau with tied rank adjustments, N a is the number of agreements in order, N d is the number of disagreements in order, N is the number of data points, and t x and t y are the number of tied observations on the first and second variables, respectively. While some of the ranks have changed compared to Table 8 due to the correlation method switch, Table 9 consolidates the fact that the -F and > C < substructures are highly correlated with the refractive index.
Trend analysis. Understanding the influence of changing input values on the output is another illuminative way of analyzing the results. This insight can be gained from the trend analysis of the alkyl chain length, temperature, and wavelength. All parameters except the considered one are fixed to display only the effect of the parameter in question. Additionally, the CatBoost model was used in trend analysis as it is our most accurate model. Figure 19 shows the trend of refractive index with respect to the number of carbons in the cation of three ILs named 1-alkyl-3-methylimidazolium tetrafluoroborate 61 , 1-alkyl-3-methylimidazolium hexafluorophosphate 62 , and 1-alkyl-3-methylimidazolium trifluoromethanesulfonate 63 for experimental and predicted data. According to Fig. 19, the refractive index rises with the increase in the alkyl chain length of the cation of imidazolium-based ionic liquids. This behavior is due to the molar refraction variation with the number of carbon atoms 1 .
Demonstrating the effect of temperature on the refractive index was done by considering 1-butyl-3-methylimidazolium tetrafluoroborate 64 , 1-ethyl-3-methylimidazolium acetate 65 , and tributylmethylphosphonium methyl sulfate 66 in Fig. 20. Unlike the alkyl chain length, an increase in ILs' temperature reduces the refractive index value. This phenomenon happens because when the temperature rises, the density of the ionic liquid decreases, which consequently increases the free molar volume. This increase in the free molar volume causes the refractive index reduction 1 . This result is in accordance with what the literature concludes 36 . Figure 11. Cross plots of the six models. www.nature.com/scientificreports/ Finally, the effect of wavelength change on the refractive index has been illuminated in Fig. 21. Three chosen ILs to display this effect are 1-ethyl-3-methylimidazolium tetrafluoroborate, 1-ethyl-3-methylimidazolium bis((trifluoromethyl)sulfonyl)imide, and 1-butyl-3-methylimidazolium trifluoromethanesulfonate 1  Leverage approach. Leverage analysis is a method that could reveal outliers and the approximate range within which a prediction is likely to be accurate. Identifying the leverage points is essential because they might  www.nature.com/scientificreports/ influence the prediction considerably. The hat matrix needs to be introduced as follows to determine the leverages of the inputs 68 : The diagonal elements of the hat matrix are named leverages and satisfy: where h ii is the diagonal elements of the hat matrix. A threshold can be introduced as the upper limit of the standard values, which is usually defined as: where a and is the number of inputs and n is the number of data points 69 . The following equation calculates standardized residuals:  [n] [pyr] [py] [p] [mo] [pip] [pz] [s] [bic] Cation Code www.nature.com/scientificreports/ where MSE is the mean square error and e i is the ordinary residual of the i-th observation. As an accepted standard used by researchers, if the absolute value of a data point's standardized residual is less than 3, the data is considered valid. The data out of this boundary is considered to be suspected 70 .
The Williams diagram can be plotted in Fig. 22 thanks to the available leverages and standardized residuals. The lines R = 3 and R = − 3 are drawn to indicate the limits of the valid data. Also, to emphasize the limit beyond which the data points have high leverages, Hat = 0.019 is displayed. With these criteria, 95 out of 6098 data points (roughly 1.5%) were detected as suspected data, and 273 were good high leverage points (approximately 4.5%). So the number of valid data was 5730 (about 94% of the total data). This finding shows that a small portion of the data was not reasonable, and the CatBoost model's performance was impressive. Setting the y-axis and x-axis to display data in the range of [− 10, 10] and [0, 0.15], respectively, can help the data points variation to be displayed clearly. Four points have an R of less than − 10, two points have an R of more than 10, and one point has a leverage value of more than 0.15. These seven points are not shown in Fig. 22, but the supplementary data section provides the entire plot (Fig. S1). The dataset has an unusual point with a very high Hat value (around 0.33). The point belongs to an ionic liquid with three -I substructures, unprecedented in the dataset. Because the leverage method only focuses on the inputs regardless of the output amounts, this anomaly in the inputs escalates the leverage value.

Conclusions
This study aimed to predict the refractive index of an abundant number of ILs. As a novel approach, the wavelength and 36 chemical substructures were considered as inputs, along with the temperature. More than 6000 data points were gathered and used in 6 different chemical structure-based machine learning models named XGBoost, LightGBM, CatBoost, CNN, Ada-DT, and Ada-SVM to achieve this study's aim. The results' statistical and visual analysis reveals that the most accurate model was CatBoost. Other models also performed effectively and can be sorted as XGBoost, LightGBM, Ada-DT, CNN, and Ada-SVM regarding their accuracy. Other findings of the research are highlighted as follows: • The sensitivity analysis showed that the -F and > C < substructures have the most influence over the predicted refractive index of an ionic liquid. The presence of these substructures in an ionic liquid declines the amount of the refractive index. • Apart from the type of ILs, the temperature has a more powerful effect in calculating the refractive index than the wavelength.  www.nature.com/scientificreports/ • Neither temperature nor wavelength was among 50% of the most influential inputs on the refractive index. The type of ionic liquid, most precisely, the presence of certain chemical substructures, had more impact on the output than temperature and wavelength. • The results of the leverage approach display that some points have uncommon leverage values. This occurrence could result from abnormal chemical substructures in the ILs. • Using machine learning methods for predicting the refractive index of a vast number of ILs showed an extraordinary performance that even our worst model's performance was acceptable with an R 2 of 0.9227 and an AAPRE of 0.6618. At the same time, our best model's statistical results were exceptional, with an R 2 of 0.9973 and an AAPRE of 0.0545. • The trend analysis reveals that the refractive indices of ILs decline with wavelength and temperature rise while the refractive indices of imidazolium-based ILs increase with the alkyl chain length increase.

Data availability
All data generated or analyzed during this study are included in this published article (and its Supplementary Information files).